I’ve chosen to talk about this one because it involves the most data preprocessing/management steps, and I was in charge of all of them
About the experiment
Studying language learning and processing mechanisms for singular they
Including pronouns on nametags and in introductions are common recommendations for creating a more gender-inclusive environment. We know it can affect people’s perception of an environment, but does it also affect people’s language use?
Participants:
Learned about a set of fictional characters (he/him, she/her, and they/them)
Nametag condition: Varied whether the introductions to the character explicitly stated their pronouns (This is Alex, who uses they/them pronouns. They…)
Introduction condition: Varied whether the nametags included pronouns
Speech production task eliciting possessive pronouns (Alex gave the apple to their brother.)
Survey about their demographics, experience with singular they, and attitudes about singular they
About the data
Audio data transcribed and annotated for which pronouns were produced & survey data for each participant
Do the nametag and introduction conditions affect accuracy producing singular they?
If production accuracy is internally reliable, is it predicted by demographics, language attitude, or language experience measures?
Pipeline overview
Power analysis
Create a data structure with the structure of the proposed experiment, and estimate fixed and random effect sizes based on prior experiments.
# get Pronoun * PSA interaction from Exp2 production modelload("r_data/exp2.RData")exp2_r_effect_size <- exp2_m_prod@model |>tidy() |>filter(term =="Pronoun=They_HeShe:PSA=GenLang") |>pull(estimate) |>round(2)exp2_r_effect_size # log-oddsexp(exp2_r_effect_size) # odds ratio# start with 108 participants each doing 30 trialsexp3_pw_data_struct <-data.frame(Participant =rep(as.factor(1:108), each =30),Trial =rep(as.factor(1:30), 108))# Trials are split between 3 Pronoun Pair conditions, which are contrast-coded# to compare:# (1) They|HeShe vs HeShe|They + HeShe|SheHe# (2) HeShe|They vs HeShe\|SheHeexp3_pw_data_struct <- exp3_pw_data_struct |>bind_cols("Pronoun"=rep(rep(factor(c("He", "She", "They")), each =10), 108) )contrasts(exp3_pw_data_struct$Pronoun) <-cbind("_T vs HS"=c(.33, .33, -.66),"_H vs S"=c(-.5, .5, 0))# Nametag and Introduction conditions vary in a 2x2 between-P design, and both# are mean-centered effects coded.exp3_pw_data_struct <- exp3_pw_data_struct |>bind_cols("Nametag"=rep(rep(factor(c(0, 0, 1, 1)), each =30), 108/4),"Intro"=rep(rep(factor(c(0, 1, 0, 1)), each =30), 108/4) )contrasts(exp3_pw_data_struct$Nametag) <-cbind("_No_Yes"=c(-.5, .5))contrasts(exp3_pw_data_struct$Intro) <-cbind("_No_Yes"=c(-.5, .5))# Item is defined as each unique image-name-pronoun combination. There are 6# sets of characters, and each list sees 3, making 18 unique characters.exp3_pw_data_struct <- exp3_pw_data_struct |>bind_cols("Character"=rep(as.factor(1:18), each =30/3, 108/6) )str(exp3_pw_data_struct)exp3_pw_data_struct |>group_by(Nametag, Intro) |>summarise(n_distinct(Participant))# The closest thing to existing data is the Exp2 (written) production task.# Since interpreting effect sizes is apparently more complicated for logistic# regression, let's go with the Exp2 results as a baseline. That's a rough# estimate of how much harder they/them is to produce than he/him and she/her.# And let's set the hypothetical Nametag and Introduction effects to be about# the same size as the PSA. Hopefully that's small enough to be kind of# conservative with the power analysis, but not aiming for effects too small to# be practically relevant.exp2_m_prod_fixed <- exp2_m_prod@model |>tidy() |>filter(effect =="fixed") |>select(term, estimate)exp2_m_prod_fixed# Predictions for Exp3 based on ranges from Exp2:exp3_pw_fixed <-c(+0.75, # Intercept Medium+3.00, # Pronoun: T vs HS Largest-0.10, # Pronoun: H vs S NS, maybe small+0.10, # Nametag NS, maybe small+0.10, # Introduction NS, maybe small-2.00, # Pronoun: T vs HS * Nametag Same size as PSA interaction-0.10, # Pronoun: H vs S * Nametag NS, maybe small-2.00, # Pronoun: T vs HS * Intro Same size as PSA interaction-0.10, # Pronoun: H vs S * Intro NS, maybe small+0.25, # Nametag * Intro Maybe small-2.00, # 3 way T vs HS Same size as PSA interaction-0.10# 3 way H vs S NS, maybe small)# The model for the Exp2 production task only converged with random intercepts# by item, and no random effects by participant.exp2_m_prod_random <-VarCorr(exp2_m_prod@model)# The model for the Exp1 production task only converged with random intercepts# and slopes by participant, and no random effects by item.load("r_data/exp1.RData")exp1_m_prod_random <-VarCorr(exp1a_m_prod@model)# So, I'll combine those two as a starting place to estimate the random effects.# It's possible the actual data won't converge with the maximal random effects# structure, but for now let's assume it will.exp3_pw_random <- exp1_m_prod_randomexp3_pw_random[["Item"]] <- exp2_m_prod_random[["Name"]]# Create model with this data structure, fixed effects, and random effectsexp3_pw_m_108 <-makeGlmer(formula = SimAcc ~ Pronoun * Nametag * Intro + (Pronoun | Participant) + (1| Character),family = binomial,fixef = exp3_pw_fixed,VarCorr = exp3_pw_random,data = exp3_pw_data_struct)summary(exp3_pw_m_108)
Power analysis
Use {simr} (Green and MacLeod 2016) to simulate the power for each effect (Pronoun × Nametag/Intro, Pronoun × Nametag × Intro) at 108, 132, 156, and 180 participants.
# Simulate dataexp3_pw_sim_data <-doSim(exp3_pw_m_108)exp3_pw_data_struct <- exp3_pw_data_struct |>bind_cols("SimAcc"= exp3_pw_sim_data)summary(exp3_pw_data_struct)# Code to run simulation:powerSim( exp3_pw_m_108,nsim =1000,test =fixed("Pronoun_T vs HS:Nametag_No_Yes", "z"))# Then extend model to larger Nexp3_pw_m_132 <-extend(exp3_pw_m_108, along ="Participant", n =132)# Load and join resultsexp3_pw_results <-bind_rows(.id ="sim","2_108"=readRDS("r_data/exp3_power_2way_N108.RDA") |>summary(),"2_132"=readRDS("r_data/exp3_power_2way_N132.RDA") |>summary(),"2_156"=readRDS("r_data/exp3_power_2way_N156.RDA") |>summary(),"2_180"=readRDS("r_data/exp3_power_2way_N180.RDA") |>summary(),"3_132"=readRDS("r_data/exp3_power_3way_N132.RDA") |>summary(),"3_156"=readRDS("r_data/exp3_power_3way_N156.RDA") |>summary() ) |>mutate(n_participants =str_sub(sim, 3),effect =case_when(str_sub(sim, 0, 1) =="2"~"Pronoun * Nametag/Intro",str_sub(sim, 0, 1) =="3"~"Pronoun * Nametag * Intro" ) ) |>column_to_rownames(var ="sim")
Power analysis
We determined that 156 participants, each completing 30 trials, would have 0.93 [0.91, 0.94] power at α = .05 to detect the two-way interactions (Pronoun × Nametag/ Introduction).
Note that in cognitive psychology, the goal is have enough statistical power to detect differences between experimental conditions, not necessarily to be able to generalize differences between groups of participants to the entire population.
We can get a decently representative sample of respondents from Prolific, but didn’t do population weights.
Pipeline overview
Audio data (AWS S3)
PCIbex, our experiment platform, sends the audio data to an AWS S3 bucket
It’s most efficient to just download the data from S3 once and run the rest of the analyses locally, instead of querying it from S3 every time
Bash script to download new data; check that an audio file for each trial for each participant exists as expected; then unzip, convert, and sort audio files
Audio data (AWS S3)
# Options:# s sync data from AWS# p check participant list# z unzip and sort audio files# c run tests on PCIbex output and audio file names# t transcribewhilegetopts"spzct"option;docase$optionins)# Sync audio data from S3echo"Getting data from AWS"cd ../data/s3/aws s3 sync s3://they3 .cd ../../preprocessing/;;p)# Get list of participants from PCIbex data to update participant listecho"Checking audio data to see what needs to be added to the participant list"Rscript participant_list.R;;z)# Unzip the audio data and convert it to WAV files in dirs for each participantecho"Unzipping, converting, and sorting the audio files"python s3_to_wav.py;;c)# Check outputecho"Checking the audio file names against the PCIbex data"Rscript check_output.R;;t)# Transcribeecho"Transcribing"python transcribe.py;;esacdone
Pros: fairly quick, runs locally and does not get copy of identifiable data
Cons: does not include speech errors and disfluencies
Transcribe using whisper
import osimport whisperimport pandas as pdfrom pathlib import Path# ---- Helper functions ----- #def make_transcription_df():"""Set up df for transcription data. Returns: df: columns for `participant_id`, `prolific_id`, `trial_id`, indexed by `file_path` """ transcriptions = [] participant_dirs = [ p for p in audio_dir.iterdir()ifnot p.match("*temp*") andnot p.match("*incomplete*") ]for p in participant_dirs: audio_list = [a.stem for a in p.glob('*.wav')] trials = [get_trial_info(p.name, a) for a in audio_list] df = pd.DataFrame( trials, columns=['file_path', 'participant_id', 'prolific_id', 'trial_id'], ) df = df.set_index('file_path').sort_values(by='trial_id') transcriptions.append(df)return transcriptionsdef get_trial_info(p_dir, file_name):"""Get trial info from the name of the audio file. Args: p_dir (str): dir for participant data file_name (str): audio file within participant's data dir Returns: list: .wav file name (Path), participant ID (str), prolific ID (str), and trial ID (str) """ participant_id, prolific_id = p_dir.split('_') trial_id = file_name.removeprefix(prolific_id +'_').removesuffix('.wav')return [ audio_dir / p_dir /f"{file_name}.wav", participant_id, prolific_id, trial_id ]def run_whisper_on_participant(df, model):"""Use whisper to transcribe a trial. Args: df (df): structure for transcription data from `make_transcription_df()`, which has `participant_id` as the first column and is indexed by the path to the audio file model (whisper model): whisper model loaded (using small English-only) """ participant_id = df.iloc[0, 0] file_path = text_dir /f"{participant_id}_whisper.csv"ifnot os.path.exists(file_path):print(participant_id) df['text'] = df.index.map(lambda t: whisper.transcribe(model, str(t))['text'])print(df['text']) df.to_csv(os.path.join(file_path))# ---- Main function ----- #def transcribe_trials():"""Main function to transcribe .wav files using whisper.""" transcriptions = make_transcription_df() model = whisper.load_model('medium.en')for p in transcriptions: run_whisper_on_participant(p, model)return transcriptions# ---- Run ----- #audio_dir = Path('..') /'data'/'exp2_audio'text_dir = Path('..') /'data'/'exp2_transcription'transcribe_trials()
Check transcriptions
RA listened to audio and added back in disfluencies
Coded for which pronouns are produced; accuracy determined by final pronoun
participant id
condition
nametag
intro
pronoun pair
target pronoun
target id
distractor pronoun
trial id
transcription
he
his
she
her
they
their
disfluency
multiple pronouns
pronoun produced
accuracy
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical01_he
Taylor gave the chocolate to his brother.
0
1
0
0
0
0
1
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical02_he
Taylor gave the cherries to their brother.
0
0
0
0
0
1
0
0
their
0
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical03_he
Taylor gave the avocado to his brother.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical04_he
Taylor gave the pumpkin to his brother.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical05_he
Sam gave the bread to his--Taylor gave the bread to his sister.
0
1
0
0
0
0
1
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical06_he
Taylor gave the balloon to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical07_he
Taylor gave the cards to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_HS
he
10
she
nametag_list4_critical08_he
Taylor gave the glasses to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical09_he
Taylor gave the corn to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical10_he
Taylor gave the kiwi to his brother.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical11_he
Taylor gave his brother grapes.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical12_he
Taylor gave the pear to his brother.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical13_he
Taylor gave the wa--orange juice--
0
0
0
0
0
0
1
0
none
NA
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical14_he
Taylor gave the yellow cap to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical15_he
Taylor gave the scissors to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_T
he
10
they
nametag_list4_critical16_he
Taylor gave the suitcase to his sister.
0
1
0
0
0
0
0
0
his
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical17_she
Jordan gave the spoon to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical18_she
Jordan gave the broccoli to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical19_she
Jordan gave her brother an egg.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical20_she
Jordan gave the strawberry to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical21_she
Jordan gave the pineapple to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical22_she
Jordan gave the bucket to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical23_she
Jordan gave the watch to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_HS
she
11
he
nametag_list4_critical24_she
Jordan gave the guitar to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical25_she
Jordan gave the banana to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical26_she
Jordan gave the bacon to their brother--to her brother.
0
0
0
1
0
1
1
1
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical27_she
Jordan gave the ice cream to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical28_she
Jorin gave the carrot to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical29_she
Jordan gave the lemon to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical30_she
Jordan gave the stuffed animal to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical31_she
Jordan gave the rose to her sister.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
HS_T
she
11
they
nametag_list4_critical32_she
Jordan gave the water bottle to her brother.
0
0
0
1
0
0
0
0
her
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical33_they
Sam gave the pizza to their brother.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical34_they
Sam gave the plate to their brother.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical35_they
Sam gave the orange to her brother.
0
0
0
1
0
0
1
0
her
0
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical36_they
Sam gave the potato to her--their brother.
0
0
0
1
0
1
1
1
their
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical37_they
Sam gave the apple to their sister.
0
0
0
0
0
1
1
0
their
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical38_they
Sam gave the brown bag to her sister.
0
0
0
1
0
0
0
0
her
0
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical39_they
Sam gave the teddy bear to their sister.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
she
nametag_list4_critical40_they
Sam gave the paintbrush to their sister.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical41_they
Sam gave the mushroom to their sister.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical42_they
Sam gave the onion to their brother.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical43_they
Sam gave their brother a watermelon.
0
0
0
0
0
1
1
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical44_they
Sa--Sam gave the knife to her brother--to their brother.
0
0
0
1
0
1
1
1
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical45_they
Sam gave the cookie to their brother.
0
0
0
0
0
1
0
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical46_they
Sam gave the tomato to their sister.
0
0
0
0
0
1
1
0
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical47_they
Sam gave the pencil to her sister--their sister.
0
0
0
1
0
1
1
1
their
1
P202
nametag
1
0
T_HS
they
12
he
nametag_list4_critical48_they
Sam gave the soccer ball to their sister.
0
0
0
0
0
1
0
0
their
1
Check transcriptions
Tests to check coding against regex and for completion
Survey data is written in log file format that needs to be parsed to remove irrelevant data, select the participant/condition-level data that I recorded at the beginning, and select the trial-level data that I recorded with each trial:
Preprocess
R code to wrangle survey data:
library(tidyverse)library(janitor)library(readxl)# MAIN DF----## Read PCIbex output----d_survey <-list.files(path ="data/exp2_PCIbex/", full.names =TRUE) |>map_df(~read.csv(., header =FALSE, fill =TRUE, col.names =paste("V", 1:26)) ) |>filter(str_detect(V.1, "#") ==FALSE) |># drop PennController commentsselect(V.13, V.6, V.9, V.15, V.10, V.11) |># drop PennController extra colsrename( # name PennController output columnstrial_type = V.6, # trial label in PCIbextrial_part = V.9, # type of trial dataparameter = V.10,response = V.11,prolific_id = V.13, # first trial variable saved is prolific_idtrial_item = V.15 ) |>filter(str_detect( # get demographics and familiarity questions trial_type, "demographics|sentences|they|transphobia" )) |># Remove status update rows for themfilter(parameter !="_Header_"& parameter !="_Trial_") |>filter(parameter !="First"& parameter !="Unselect") |>filter(parameter !="Status"& parameter !="Filename") |>filter(!(# remove rows that indicate last item selected in check box parameter =="Choice"& ( trial_part =="enter_they"| trial_part =="enter_trans"| trial_part =="enter_sexuality"| trial_part =="enter_race" ) )) |>filter(!(parameter =="Final"& response =="")) # write-in box empty## Match to Participant ID----participant_list <-"data/participant_list.xlsx"|>read_xlsx(sheet =1, range =cell_cols(1:10)) |>clean_names() |>select(ends_with("id"), condition) |>filter(!is.na(participant_id)) |>mutate(across(everything(), as.factor))d_survey <- d_survey |>left_join(participant_list, by ="prolific_id") |>relocate(participant_id, .before =1) |>select(-prolific_id, -condition)## Exclusions----d_survey <- d_survey |>filter(str_detect(participant_id, "P")) |>droplevels()## Question categories & items----d_survey <- d_survey |>mutate(.after = participant_id,category =case_when( trial_type =="rate_sentences"~"Sentence Naturalness Ratings", trial_type =="transphobia_scale"~"Transphobia Scale", trial_part =="enter_they"~"Familiarity With They/Them Pronouns",str_detect(trial_part, "intro|nametag") ~"Familiarity With Pronoun-Sharing Practices",str_detect(trial_part, "age") ~"Age",str_detect(trial_part, "gender") ~"Gender",str_detect(trial_part, "trans") ~"Transgender & Gender-Diverse",str_detect(trial_part, "sexuality") ~"Sexuality",str_detect(trial_part, "race") ~"Race/Ethnicity",str_detect(trial_part, "english") ~"English Experience",str_detect(trial_part, "ed") ~"Education" ) |>as.factor(),item =case_when( trial_part =="enter_intro_others"~"Intros: Others", trial_part =="enter_intro_self"~"Intros: Self", trial_part =="enter_nametags_others"~"Nametags: Others", trial_part =="enter_nametags_self"~"Nametags: Self",str_detect(parameter, "for myself") ~"Myself",str_detect(parameter, "am close to") ~"Close To",str_detect(parameter, "have met") ~"Have Met",str_detect(parameter, "have not met") ~"Heard About",str_detect(parameter, "had not heard") ~"Not Heard About", trial_item !=""~ trial_item,str_detect(category, "Trans|Sexuality|Race") ~ parameter,str_detect(category, "Ed|Eng") ~ response,str_detect(category, "Gender") ~ category ) |>recode_factor("generic"="Generic","each"="Each","every"="Every","neu"="Neutral\nName","fem"="Fem\nName","masc"="Masc\nName" ) |>str_replace_all(c("%2C"=",", "2 year"="2-year", "4 year"="4-year", "term:"="term" )) |>as.factor() ) |>select(-starts_with("trial"), -parameter)## Response types----d_survey$response <- d_survey$response |>str_replace_all(c("%2C"=",", "2 year"="2-year", "4 year"="4-year"))d_survey <- d_survey |>mutate(response_num =case_when(!is.na(as.numeric(response)) ~as.numeric(response),is.na(as.numeric(response)) ~NA ),response_bool =case_when( response =="checked"~TRUE, response =="unchecked"~FALSE, response !="checked"& response !="unchecked"~NA ),response_cat =case_when(is.na(response_num) &is.na(response_bool) ~ response,.default =NA ),item =case_when(!is.na(item) ~ item, category =="Age"& response_num <=24~"18–24", category =="Age"& response_num >=25& response_num <=34~"25–34", category =="Age"& response_num >=35& response_num <=44~"35–44", category =="Age"& response_num >=45& response_num <=54~"45–54", category =="Age"& response_num >=55& response_num <=64~"55–64", category =="Age"& response_num >=65& response_num <=74~"65–74", response_num >=75~"75+" ) ) |>mutate(across(where(is.character), as.factor)) |>select(-response) |>filter(!is.na(item))## Recode gender----d_survey |>filter(category =="Gender") |>pull(response_cat) |>droplevels() |>unique()# Group similar responsesd_survey$response_cat <- d_survey$response_cat |>recode_factor("female"="Woman", "f"="Woman", "Femal"="Woman", "Female"="Woman","FEMALE"="Woman", "woman"="Woman", "WOMAN"="Woman", "Female "="Woman","female/woman"="Woman", "Female/Woman"="Woman", "cis woman"="Woman","Cisfemale"="Woman", "cisgender woman"="Woman", "transwoman"="Woman","male"="Man", "MALE"="Man", "Male"="Man", "Male "="Man", "Man"="Man","cis-gender male"="Man", "cis male"="Man", "TRANS MAN"="Man","Transgender Man"="Man", "Nonbinary"="Nonbinary spectrum","nonbinary"="Nonbinary spectrum", "Non-binary"="Nonbinary spectrum","non binary"="Nonbinary spectrum", "Transfem nonbinary"="Nonbinary spectrum","Male and nonbinary"="Nonbinary spectrum", "she/they"="Nonbinary spectrum","genderfluid"="Nonbinary spectrum", "Genderfluid"="Nonbinary spectrum","questioning"="Questioning" )d_survey |>filter(category =="Gender") |>pull(response_cat) |>droplevels() |>unique()## Write-in responses----d_survey$item <- d_survey$item |>recode_factor("Final"="I use a different term")# Just keep one row for yes to diff term + write-in boxd_survey <- d_survey |>mutate(response_bool =case_when( item =="I use a different term"&!is.na(response_cat) ~TRUE,.default = response_bool ) ) |>filter(!(item =="I use a different term"&is.na(response_cat) & response_bool ==TRUE) ) |>filter(response_cat !="Normal"|is.na(response_cat)) # asshole response that also checked straight## Add missing data----missing <-tibble(participant_id =c(rep("P277", 31), rep("P278", 31), rep("P419", 31),rep("P482", 31), rep("P502", 31) ),category =rep(c("Age", "Education", "English Experience",rep("Familiarity With Pronoun-Sharing Practices", 4),rep("Familiarity With They/Them Pronouns", 5),"Gender", "Race/Ethnicity", "Sexuality",rep("Sentence Naturalness Ratings", 6),"Transgender & Gender-Diverse",rep("Transphobia Scale", 9) ),5 ),item =rep(c(rep("Missing Data", 3),"Intros: Others", "Intros: Self", "Nametags: Others", "Nametags: Self","Myself", "Close To", "Have Met", "Heard About", "Not Heard About","Missing Data", "Missing Data", "Missing Data","Masc Name", "Fem Name", "Neutral Name", "Generic", "Every", "Each","Missing Data",paste("I am uncomfortable around people who don’t conform to traditional","gender roles, e.g., aggressive women or emotional men." ),"I avoid people on the street whose gender is unclear to me.",paste("I think there is something wrong with a person who says that they","are neither a man nor a woman." ),paste("I would be upset if someone I’d known a long time revealed to me","that they used to be another gender." ),paste("When I meet someone, it is important for me to be able to identify","them as a man or a woman." ),"I believe that a person can never change their gender.",paste("A person’s genitalia define what gender they are, e.g., a penis","defines a person as being a man, a vagina defines a person as being a","woman." ),paste("I don’t like it when someone is flirting with me, and I can’t tell","if they are a man or a woman." ),"I believe that the male/female dichotomy is natural." ),5 ))missing <- missing |>mutate(response_num =as.numeric(NA),response_bool =NA,response_cat =as.character(NA) ) |>mutate(across(where(is.character), as.factor))d_survey <-bind_rows(d_survey, missing) |>distinct() |>mutate(participant_id =as.factor(as.character(participant_id)))# Aggregates----## Age----agg_age <- d_survey |>filter(category =="Age") |>select(participant_id, response_num) |>arrange(participant_id) |>rename(age = response_num)## TGD----agg_TGD <- d_survey |>filter( category =="Transgender & Gender-Diverse"& response_bool ==TRUE ) |>select(participant_id, item, response_bool) |>mutate(response_coded =case_when(str_detect(item, "is different") ~1, item =="I consider myself transgender"~1,.default =0 ) ) |>summarise(.by = participant_id,TGD =sum(response_coded) |>recode(`2`=1) )## LGBQ----agg_LGBQ <- d_survey |>filter(category =="Sexuality"& response_bool ==TRUE) |>select(participant_id, item, response_bool) |>mutate(response_coded =ifelse(str_detect(item, "As|Bi|Gay|Queer"), 1, 0)) |>summarise(.by = participant_id,LGBQ =sum(response_coded) |>recode(`2`=1, `3`=1) )## Transphobia scale----agg_TS <- d_survey |>filter(category =="Transphobia Scale") |>select(participant_id, response_num) |>mutate(response_coded = response_num -1) |>summarise(.by = participant_id, gender_beliefs =sum(response_coded))## Sentence ratings----agg_ratings <- d_survey |>filter(category =="Sentence Naturalness Ratings") |>select(participant_id, item, response_num) |>mutate(type =ifelse(str_detect(item, "Name"), "rating_name", "rating_indefinite" ) ) |>summarise(.by =c(participant_id, type), rating =mean(response_num)) |>pivot_wider(names_from ="type", values_from ="rating")## Familiarity with using they/them----agg_they <- d_survey |>filter(str_detect(category, "They/Them") & response_bool ==TRUE) |>select(participant_id, item, response_bool) |>pivot_wider(names_from = item, values_from = response_bool) |>mutate(Myself_Close =ifelse( Myself ==TRUE&`Close To`==TRUE, TRUE, NA )) |>mutate(.keep =c("unused"),familiarity =case_when( Myself_Close ==TRUE|`Close To`==TRUE| Myself ==TRUE~3,`Have Met`==TRUE~2,`Heard About`==TRUE|`Not Heard About`==TRUE~1 ) )## Familiarity with pronoun-sharing----agg_sharing <- d_survey |>filter(str_detect(category, "Sharing")) |>select(participant_id, item, response_cat) |>mutate(response_coded =case_when( response_cat =="Always"| response_cat =="All"~5, response_cat =="Usually"| response_cat =="Most"~4, response_cat =="Sometimes"| response_cat =="Some"~3, response_cat =="Rarely"| response_cat =="A few"~2,str_detect(response_cat, "prefer not to") ~1,str_detect(response_cat, "not heard") ~0, response_cat =="None"~0 )) |>summarise(.by = participant_id, sharing =sum(response_coded))## Merge---d_agg <- participant_list |>select(participant_id, condition) |>left_join(agg_age, by ="participant_id") |>left_join(agg_LGBQ, by ="participant_id") |>left_join(agg_TGD, by ="participant_id") |>left_join(agg_ratings, by ="participant_id") |>left_join(agg_sharing, by ="participant_id") |>left_join(agg_they, by ="participant_id") |>left_join(agg_TS, by ="participant_id")# Demographics table----d_demographics <- d_survey |>filter( category %in%c("Age", "Gender", "Transgender & Gender-Diverse", "Sexuality","Race/Ethnicity", "Education", "English Experience" ) ) |>filter(response_bool ==TRUE|is.na(response_bool)) |>mutate(group =case_when( category =="Gender"~as.character(response_cat), category =="English Experience"~as.character(response_cat), category =="Education"~as.character(response_cat), category =="Sexuality"~as.character(item), category =="Race/Ethnicity"~as.character(item), category =="Transgender & Gender-Diverse"~as.character(item), category =="Age"~as.character(item) )) |>select(-(starts_with("response")), -item) |>mutate(group = group |>replace_na("Prefer not to answer / Missing data") |>recode_factor("Prefer not to answer"="Prefer not to answer / Missing data","prefer not to answer"="Prefer not to answer / Missing data","Missing Data"="Prefer not to answer / Missing data" ) ) |>summarise(.by =c(category, group), total =n_distinct(participant_id))dem_totals <- d_demographics |>group_by() |>summarise(.by = category, total =sum(total)) |>mutate(group ="Total")d_demographics <- d_demographics |>bind_rows(dem_totals) |>arrange(category, group)# # Export----write_csv(d_survey, "data/exp2_survey.csv")write_csv(d_agg, "data/exp2_participant_covariates.csv")write_csv(d_demographics, "data/exp2_demographics.csv")
Preprocess
Survey questions parsed:
ParticipantID
Condition
List
Category
Item
Response_Num
Response_Bool
Response_Cat
3_001
both
1
Sentence Naturalness Ratings
Masc Name
1
NA
NA
3_001
both
1
Sentence Naturalness Ratings
Fem Name
1
NA
NA
3_001
both
1
Sentence Naturalness Ratings
Neutral Name
1
NA
NA
3_001
both
1
Sentence Naturalness Ratings
Every
5
NA
NA
3_001
both
1
Sentence Naturalness Ratings
Generic
1
NA
NA
3_001
both
1
Sentence Naturalness Ratings
Each
6
NA
NA
3_001
both
1
Familiarity With They/Them Pronouns
Myself
NA
FALSE
NA
3_001
both
1
Familiarity With They/Them Pronouns
Close To
NA
FALSE
NA
3_001
both
1
Familiarity With They/Them Pronouns
Have Met
NA
FALSE
NA
3_001
both
1
Familiarity With They/Them Pronouns
Heard About
NA
TRUE
NA
3_001
both
1
Familiarity With They/Them Pronouns
Not Heard About
NA
FALSE
NA
3_001
both
1
Familiarity With Pronoun-Sharing Practices
Intros: Others
NA
NA
Some
3_001
both
1
Familiarity With Pronoun-Sharing Practices
Intros: Self
NA
NA
Never, because I prefer not to
3_001
both
1
Familiarity With Pronoun-Sharing Practices
Nametags: Others
NA
NA
Most
3_001
both
1
Familiarity With Pronoun-Sharing Practices
Nametags: Self
NA
NA
Never, because I prefer not to
3_001
both
1
Transphobia Scale
I am uncomfortable around people who don't conform to traditional gender roles, e.g., aggressive women or emotional men.
2
NA
NA
3_001
both
1
Transphobia Scale
I avoid people on the street whose gender is unclear to me.
2
NA
NA
3_001
both
1
Transphobia Scale
I think there is something wrong with a person who says that they are neither a man nor a woman.
6
NA
NA
3_001
both
1
Transphobia Scale
I would be upset if someone I'd known a long time revealed to me that they used to be another gender.
5
NA
NA
3_001
both
1
Transphobia Scale
When I meet someone, it is important for me to be able to identify them as a man or a woman.
7
NA
NA
3_001
both
1
Transphobia Scale
I believe that a person can never change their gender.
7
NA
NA
3_001
both
1
Transphobia Scale
A person's genitalia define what gender they are, e.g., a penis defines a person as being a man, a vagina defines a person as being a woman.
7
NA
NA
3_001
both
1
Transphobia Scale
I don't like it when someone is flirting with me, and I can't tell if they are a man or a woman.
7
NA
NA
3_001
both
1
Transphobia Scale
I believe that the male/female dichotomy is natural.
7
NA
NA
3_001
both
1
Age
45-54
53
NA
NA
3_001
both
1
Gender
Gender
NA
NA
Male
3_001
both
1
Transgender & Gender-Diverse
My gender is the same as what was written on my original birth certificate
NA
TRUE
NA
3_001
both
1
Transgender & Gender-Diverse
My gender is different than what was written on my original birth certificate
NA
FALSE
NA
3_001
both
1
Transgender & Gender-Diverse
I consider myself cisgender
NA
FALSE
NA
3_001
both
1
Transgender & Gender-Diverse
I consider myself transgender
NA
FALSE
NA
3_001
both
1
Transgender & Gender-Diverse
I don't consider myself cisgender or transgender
NA
FALSE
NA
3_001
both
1
Transgender & Gender-Diverse
Prefer not to answer
NA
FALSE
NA
3_001
both
1
Sexuality
Asexual
NA
FALSE
NA
3_001
both
1
Sexuality
Bisexual/Pansexual
NA
FALSE
NA
3_001
both
1
Sexuality
Gay/Lesbian
NA
FALSE
NA
3_001
both
1
Sexuality
Heterosexual/Straight
NA
TRUE
NA
3_001
both
1
Sexuality
Queer
NA
FALSE
NA
3_001
both
1
Sexuality
Questioning
NA
FALSE
NA
3_001
both
1
Sexuality
Prefer not to answer
NA
FALSE
NA
3_001
both
1
Sexuality
I use a different term
NA
FALSE
NA
3_001
both
1
Education
Professional degree
NA
NA
Professional degree
3_001
both
1
English Experience
Native (learned from birth)
NA
NA
Native (learned from birth)
3_001
both
1
Race/Ethnicity
American Indian or Alaska Native
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
Asian
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
Black, African American, or African
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
Hispanic, Latino, or Spanish
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
Middle Eastern or North African
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
Native Hawaiian or Pacific Islander
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
White
NA
TRUE
NA
3_001
both
1
Race/Ethnicity
Prefer not to answer
NA
FALSE
NA
3_001
both
1
Race/Ethnicity
I use a different term
NA
FALSE
NA
Preprocess
Survey questions coded into potential covariates:
ParticipantID
Condition
Age
LGBQ
TGD
Rating_Generic
Rating_Name
Sharing
UseThey
GenderBeliefs
3_001
both
53
0
0
4.000000
1.000000
9
1
41
3_002
nametag
21
1
0
6.666667
7.000000
16
3
0
3_003
nametag
43
0
0
5.666667
6.333333
4
2
12
3_004
nametag
50
0
0
5.666667
4.000000
6
1
22
3_005
intro
37
0
0
6.333333
1.333333
4
2
19
3_006
intro
35
1
0
6.000000
6.000000
17
3
0
3_007
intro
32
0
0
4.666667
5.333333
9
2
15
3_008
intro
48
0
0
5.666667
3.000000
9
2
8
Pipeline overview
Merge data
It’s easy to merge the participant-level survey data with the trial-level pronoun data by joining in the participant ID
For bigger projects, I write custom functions to load/set up data to ensure that the output is always identical
Merge data
# Loads accuracy data, sets up contrast coding and scaling----exp3_load_data_acc <-function() {library(dplyr)library(forcats)library(scales) d <-read.csv("data/exp3_pronouns.csv", stringsAsFactors =TRUE) |>select(ParticipantID, Nametag, Intro, Pronoun_Pair, T_ID, Accuracy)# Remove trials with no pronouns d <- d |>filter(!is.na(Accuracy))# Mean-center effects code Nametag and Intro d$Nametag <-factor(d$Nametag, labels =c("-Nametag", "+Nametag"))contrasts(d$Nametag) <-cbind(c(-.5, .5)) d$Intro <-factor(d$Intro, labels =c("-Intro", "+Intro"))contrasts(d$Intro) <-cbind(c(-.5, .5))# Orthogonal Helmert contrast codes for Pronoun Pair d <- d |>rename("Pronoun"="Pronoun_Pair") d$Pronoun <- d$Pronoun |>fct_relevel("T_HS", after =0) |>fct_relevel("HS_T", after =1)contrasts(d$Pronoun) <-cbind("Target"=c(-.66, +.33, +.33),"Dist"=c(0, -.50, +.50) )# Add dummy-coded factor for They vs He/She d <- d |>mutate(Pronoun_They0 =ifelse(Pronoun =="T_HS", 0, 1))# Dummy code Nametag and Intro d <- d |>mutate(Nametag_Yes0 =ifelse(Nametag =="+Nametag", 0, 1),Nametag_No0 =ifelse(Nametag =="-Nametag", 0, 1),Intro_Yes0 =ifelse(Intro =="+Intro", 0, 1),Intro_No0 =ifelse(Intro =="-Intro", 0, 1) )# Scale character (1-18) d <- d |>mutate(.keep =c("unused"), Character =rescale(T_ID, c(-0.5, 0.5)))# Subset and order d <- d |>select( ParticipantID, Nametag, Nametag_Yes0, Nametag_No0, Intro, Intro_Yes0, Intro_No0, Pronoun, Pronoun_They0, Character, Accuracy )return(d)}# Adds participant covariates to accuracy data, mean-centers + rescales them----exp3_load_data_subj <-function() {# Join participant covariates to accuracy df d <-left_join(exp3_load_data_acc(),read.csv("data/exp3_participant-covariates.csv", stringsAsFactors =TRUE),by ="ParticipantID" ) |>rename("Familiarity"="UseThey", "Rating"="Rating_Name")# Remove participants with no pronouns (1) or no survey data (3) d <- d |>filter(!is.na(Age)) d$ParticipantID <-droplevels(d$ParticipantID)# Scale THEN mean-center (on accuracy df) d <- d |>mutate(Age_C =scale(Age /80, center =TRUE, scale =FALSE),Familiarity_C =scale(Familiarity /2, center =TRUE, scale =FALSE),GenderBeliefs_C =scale(GenderBeliefs /60, center =TRUE, scale =FALSE),LGBTQ_C = LGBQ -0.50,Rating_C =scale(Rating /6, center =TRUE, scale =FALSE),Sharing_C =scale(Sharing /20, center =TRUE, scale =FALSE) )# Effects-code LGBTQ d <- d |>mutate(LGBTQ_Fct =as.factor(LGBTQ_C))contrasts(d$LGBTQ_Fct) <-cbind(c(-0.5, +0.5))# Subset and order d <- d |>select( ParticipantID, Nametag, Intro, Pronoun, Character, Accuracy, Age, Age_C, Familiarity, Familiarity_C, GenderBeliefs, GenderBeliefs_C, LGBTQ_C, LGBTQ_Fct, Rating, Rating_C, Sharing, Sharing_C )return(d)}
Estimate internal reliability
Before we can/should use the survey questions as predictors of production accuracy, we need to establish the internal reliability (Hedge, Powell, and Sumner 2017)
Used a Bayesian mixed-effects model approach comparing the by-participant slopes in each half of the data (Staub 2021)
Estimates of the relative accuracy of they/them compared to he/him + she/her for each participant were strongly correlated between halves of the data, r = 0.97 [0.90, 1.00]
# split into haves and create Pronoun effect vars for each halfexp3_d_reliability <-exp3_load_data_acc() |>select(ParticipantID, Pronoun, Accuracy) |>arrange(ParticipantID, Pronoun) |># sort by pronoun within participantmutate(Obs_Num =seq(1, length(Pronoun))) |>mutate(Obs_Half =case_when( # count odd and even trialsis_even(Obs_Num) ~"even",is_odd(Obs_Num) ~"odd" )) |>mutate(Pronoun_Even =case_when( # effect of pronoun just in even trials Obs_Half =="even"& Pronoun =="T_HS"~-0.66, Obs_Half =="even"& Pronoun !="T_HS"~+0.33, Obs_Half =="odd"~0 ),Pronoun_Odd =case_when( # effect of pronoun just in odd trials Obs_Half =="odd"& Pronoun =="T_HS"~-0.66, Obs_Half =="odd"& Pronoun !="T_HS"~+0.33, Obs_Half =="even"~0 ))# run Bayesian modelexp3_m_reliability <-brm(formula = Accuracy ~ Pronoun_Even + Pronoun_Odd +# fixed effects for halves (1+ Pronoun_Even + Pronoun_Odd | ParticipantID), # random slopes by subjdata = exp3_d_reliability,family =bernoulli(), # keep default priorsseed =4, cores =4,chains =4, iter =4000,file ="r_data/exp3_reliability"# won't rerun because results are copied in)exp3_m_reliability# tidy resultsexp3_r_reliability <- exp3_m_reliability |>tidy() |>filter(str_detect(term, "Even") &str_detect(term, "Odd")) |>select(estimate, std.error, conf.low, conf.high) |>mutate(across(everything(), ~format(., digits =2, nsmall =2)))exp3_r_reliability
Multilevel model
The data is nested within participants and items, so we fit a mixed-effects model with crossed random effects.
Pronoun is coded with Orthogonal Helmert contrasts (1st contrast compares they/them to he/him + she/her; 2nd contrast not relevant here). Nametag and Introduction are mean-center effects coded. Demographic/survey variables are all mean-centered.
The maximal model that converged only included random intercepts.
Stepwise regression to test if adding the demographic, language experience, and language attitude variables significantly improves model fit above the hypothesis-testing model.
Multilevel model
Using {lme4} for logistic mixed-effects regression modeling (Bates et al. 2015) and {buildmer} for stepwise model comparison (Voeten 2023)
Can run this locally, or on an Amazon EC2 instance to save time using {paws}
# Run in parallel with 6 clusters# Won't work when running as background job, but otherwise much fastercl6 <-makeCluster(6) # make 6 clusters, keep default typeclusterEvalQ(cl6, "buildmer") # check all packages are loaded to each clusterclusterExport(cl6, "exp3_d_subjCov") # check data is loaded to each clusterexp3_m_subj_cov <-buildmer(formula = Accuracy ~ Pronoun * Nametag * Intro *# allow all interactions Age_C * Familiarity_C * GenderBeliefs_C * LGBTQ_Fct + Rating_C * Sharing_C + (1| ParticipantID) + (1| Character),data = exp3_d_subjCov,family = binomial,buildmerControl =list(direction =c("order", "backward"), # max then backwards elim (default)cl = cl6,args =list(glmerControl(optimizer ="bobyqa")), # nlminbwrap had huge SE# require Pronoun * Nametag * Intro and both random intercepts# aka keep hypothesis testing modelinclude ="Pronoun * Nametag * Intro + (1 | ParticipantID) + (1 | Character)" ) )stopCluster(cl6)remove(cl6)
Model results
Intercept: More likely to produce the correct pronoun than not across all conditions (β = 13.16, z = 12.24, p < .001)
Pronoun: More accurate for he/him + she/her characters than for they/them characters (β = 5.05, z = 5.09, p < .001)
Pronoun × Nametag × Intro (β = 6.24, z = 3.91, p < .001)
–Nametag +Intro condition showed 95% accuracy for singular they
+Nametag +Intro and +Nametag –Intro conditions showed 91% accuracy
–Nametag –Intro condition showed 73% accuracy
Gender Beliefs: Participants who more strongly endorsed the gender binary and gender essentialism were less accurate overall (β = -10.07, z = -3.44, p < .001) and showed a larger relative difference in accuracy between they/them and he/him + she/her (β = 6.50, z = 3.53, p < .001)
Bonus Slides: Study Results
References
Baayen, R. H., D. J. Davidson, and D. M. Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.”Journal of Memory and Language 59 (4): 390–412. https://doi.org/10.1016/j.jml.2007.12.005.
Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.”Journal of Memory and Language 68 (3): 255–78. https://doi.org/10.1016/j.jml.2012.11.001.
Bates, D. M., Martin Mächler, Benjamin M. Bolker, and Steven C. Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.”Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Bürkner, Paul-Christian. 2017. “Brms: An r Package for Bayesian Multilevel Models Using Stan.”Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.
Green, P., and C. J. MacLeod. 2016. “simr: An R Package for Power Analysis of Generalized Linear Mixed Models by Simulation.”Methods in Ecology and Evolution 7 (4): 493–98. https://doi.org/10.1111/2041-210X.12504.
Hedge, Craig, Georgina Powell, and Petroc Sumner. 2017. “The Reliability Paradox: Why Robust Cognitive Tasks Do Not Produce Reliable Individual Differences.”Behavior Research Methods 50 (3): 1166–86. https://doi.org/10.3758/s13428-017-0935-1.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.”https://doi.org/10.48550/ARXIV.2212.04356.
Staub, Adrian. 2021. “How Reliable Are Individual Differences in Eye Movements in Reading?”Journal of Memory and Language 116: 104190. https://doi.org/10.1016/j.jml.2020.104190.